DE eng

Search in the Catalogues and Directories

Page: 1 2 3 4
Hits 1 – 20 of 63

1
Assessing the impact of OCR noise on multilingual event detection over digitised documents
In: ISSN: 1432-5012 ; EISSN: 1432-1300 ; International Journal on Digital Libraries ; https://hal.archives-ouvertes.fr/hal-03635985 ; International Journal on Digital Libraries, Springer Verlag, 2022, ⟨10.1007/s00799-022-00325-2⟩ (2022)
BASE
Show details
2
Assessing the Impact of OCR Noise on Multilingual Event Detection over Digitised Documents ...
BASE
Show details
3
Assessing the Impact of OCR Noise on Multilingual Event Detection over Digitised Documents ...
BASE
Show details
4
L3i_LBPAM at the FinSim-2 task: Learning Financial Semantic Similarities with Siamese Transformers
In: WWW '21: Companion Proceedings of the Web Conference 2021 ; WWW '21: The Web Conference 2021 ; https://hal.sorbonne-universite.fr/hal-03256324 ; WWW '21: The Web Conference 2021, Apr 2021, Ljubljana (virtual), Slovenia. pp.302-306, ⟨10.1145/3442442.3451384⟩ (2021)
BASE
Show details
5
Discovering Spatial Relations in Litterature: what is the influence of OCR noise ?
In: NewsEye’s international conference ; https://hal.archives-ouvertes.fr/hal-03199729 ; NewsEye’s international conference, Mar 2021, Paris, France (2021)
Abstract: International audience ; Digital Humanities methods enable the exploration and exploitation of digitized corpora at unprecedented scales. They also allow for refined research at several levels of granularity, from syntactic or hermeneutic perspectives, or through the identification of geographical named-entities, which allows us to observe the evolution of language and its territorial distribution. However, there are notable limitations in the performance of Named Entities Recognition tools for humanities research due to the variability of the input data (linguistic, diachronic, diatopic variability). Moreover, this lack of robustness to variation is particularly striking when dealing with literary corpora, even more so when it involves early modern texts. The correct recognition of named entities is correlated with the training of the language model implemented in the NER system. Language models are usually trained on so called “clean data” – assembled under optimal laboratory conditions – and for application to a specific corpus, which thus limits their generalizability to other corpora. Moreover, language models for early modern texts often require access to large corpora which have previously been transcribed using OCR. The quality of these transcriptions remains the subject of many current research projects[Baledent et al., 2020]. In essence, the malfunctioning of NER tools is attributed, on the one hand, to the level of quality of the transcriptions provided as input and, on the other hand, to the fact that the corpus being processed does not correspond to the corpus on which the language model was trained. To overcome the problem related to the quality of OCR transcripts, users implement a strategy that is costly both in terms of time and financially, consisting of cleaning of the transcribed text. Indeed, any number of errors can exist in OCR transcriptions[Stanislawek et al., 2019] and this search for perfection, though perhaps feasible on very small corpus, can be never-ending and represents a considerable expenditure of time at larger scales. Our project seeks to evaluate out-of-the-box NER tools, specifically Spacy, on minimally-corrected OCR transcriptions. This experiment should allow us to see the capacity of these tools to do their work outside of ideal laboratory conditions, aiming to get closer to a more everyday use of these tools, i.e. as a user who has neither the time, nor money for corrections, but nevertheless seeks actionable results. By way of this tension between ideality and reality, we have eschewed for the moment any ground-truth, which are costly to produce. Nevertheless, we use what we consider to be a reference text. The reference texts are extracted from ELTeC, a multilingual European Literary Text Collection in which entire novels are available in standardized version. The texts we use in hypothesis-testing consist of the OCR transcription of the same texts, downloaded in PDF format from the Gallica website. The first novel on which we focus is Marguerite Audoux’s Marie-Claire (1910), a novel of about 34,500 words. We carried out initial tests on short text extracts of about words and found that the pre-trained Spacy models are capable of recognising a number of terms even when roughly transcribed by the OCR tool. The ”fr core news sm” model finds 79% of entities present in both the reference and the hypothesis text, and 12.5% of entities which are incorrectly spelled in the hypothesis text.
Keyword: [SHS]Humanities and Social Sciences
URL: https://hal.archives-ouvertes.fr/hal-03199729
BASE
Hide details
6
Multilingual Epidemic Event Extraction
In: Towards Open and Trustworthy Digital Societies. 23rd International Conference on Asia-Pacific Digital Libraries, ICADL 2021, Virtual Event, December 1–3, 2021, Proceedings ; https://hal.archives-ouvertes.fr/hal-03480551 ; Hao-Ren Ke; Chei Sian Lee; Kazunari Sugiyama. Towards Open and Trustworthy Digital Societies. 23rd International Conference on Asia-Pacific Digital Libraries, ICADL 2021, Virtual Event, December 1–3, 2021, Proceedings, 13133, Springer, pp.139-156, 2021, Lecture Notes in Computer Science, 978-3-030-91668-8. ⟨10.1007/978-3-030-91669-5_12⟩ (2021)
BASE
Show details
7
Étude comparative de méthodes de classification multilingue appliquées à l'épidémiologie
In: COnférence en Recherche d'Informations et Applications - CORIA 2021, French Information Retrieval Conference ; https://hal.archives-ouvertes.fr/hal-03320343 ; COnférence en Recherche d'Informations et Applications - CORIA 2021, French Information Retrieval Conference, Apr 2021, Grenoble (virtuel), France (2021)
BASE
Show details
8
« Exploiter un corpus de données textuelles sans post-traitement : l’écriture burlesque de la Fronde »
In: ISSN: 2736-2337 ; Humanités numériques ; https://hal.archives-ouvertes.fr/hal-03500616 ; Humanités numériques, Bruxelles: Humanistica, 2021 (2021)
BASE
Show details
9
Étude comparative de méthodes de classification multilingue appliquées à l'épidémiologie ...
BASE
Show details
10
Multilingual Epidemic Event Extraction ...
BASE
Show details
11
Impact Analysis of Document Digitization on Event Extraction ...
BASE
Show details
12
Token-level Multilingual Epidemic Dataset for Event Extraction ...
BASE
Show details
13
Impact Analysis of Document Digitization on Event Extraction ...
BASE
Show details
14
Token-level Multilingual Epidemic Dataset for Event Extraction ...
BASE
Show details
15
Multilingual Epidemic Event Extraction ...
BASE
Show details
16
Étude comparative de méthodes de classification multilingue appliquées à l'épidémiologie ...
BASE
Show details
17
Multilingual Epidemiological Text Classification: A Comparative Study ...
BASE
Show details
18
Multilingual Epidemiological Text Classification: A Comparative Study ...
BASE
Show details
19
SinNer@Clef-Hipe2020 : Sinful adaptation of SotA models for Named Entity Recognition in French and German
In: CLEF 2020 Working Notes. Working Notes of CLEF 2020 - Conference and Labs of the Evaluation Forum ; https://hal.inria.fr/hal-02984746 ; CLEF 2020 Working Notes. Working Notes of CLEF 2020 - Conference and Labs of the Evaluation Forum, Sep 2020, Thessaloniki / Virtual, Greece ; https://impresso.github.io/CLEF-HIPE-2020/ (2020)
BASE
Show details
20
Daniel@FinTOC’2 Shared Task: Title Detection and Structure Extraction
In: st Joint Workshop on Financial Narrative Processing and MultiLing Financial Summarisation @COLING’2020 ; 1st Joint Workshop on Financial Narrative Processing and MultiLing Financial Summarisation @COLING’2020 ; https://hal.archives-ouvertes.fr/hal-03024867 ; 1st Joint Workshop on Financial Narrative Processing and MultiLing Financial Summarisation @COLING’2020, Dec 2020, Barcelone, Spain (2020)
BASE
Show details

Page: 1 2 3 4

Catalogues
0
0
0
0
0
0
0
Bibliographies
0
0
0
0
0
0
0
0
0
Linked Open Data catalogues
0
Online resources
0
0
0
0
Open access documents
63
0
0
0
0
© 2013 - 2024 Lin|gu|is|tik | Imprint | Privacy Policy | Datenschutzeinstellungen ändern